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We consider the problem of estimating the graph associated with 
a binary Ising Markov random field. We describe a method based 
on ^-regularized logistic regression, in which the neighborhood of 
any given node is estimated by performing logistic regression subject 
to an f i-constraint. The method is analyzed under high-dimensional 
scaling in which both the number of nodes p and maximum neigh- 
borhood size d are allowed to grow as a function of the number of 
observations n. Our main results provide sufficient conditions on the 
triple (n,p, d) and the model parameters for the method to succeed in 
consistently estimating the neighborhood of every node in the graph 
simultaneously. With coherence conditions imposed on the popula- 
tion Fisher information matrix, we prove that consistent neighbor- 
hood selection can be obtained for sample sizes n = Q(d 3 logp) with 
exponentially decaying error. When these same conditions are im- 
posed directly on the sample matrices, we show that a reduced sample 
size of n — f2(d 2 logp) suffices for the method to estimate neighbor- 
hoods consistently. Although this paper focuses on the binary graph- 
ical models, we indicate how a generalization of the method of the 
paper would apply to general discrete Markov random fields. 

1. Introduction. Undirected graphical models, also known as Markov 
random fields, are used in a variety of domains, including statistical physics 
[17], natural language processing [21], image analysis [8, 14, 37] and spatial 
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statistics [26], among others. A Markov random field (MRF) is specified by 
an undirected graph G = (V, E) with vertex set V = {1, 2, . . . ,p} and edge set 
E C V x V. The structure of this graph encodes certain conditional indepen- 
dence assumptions among subsets of the p-dimensional discrete random vari- 
able X = (X\,X2, • • • , X p ) where variable Xi is associated with vertex i E V . 
One important problem for such models is to estimate the underlying graph 
from n independent and identically distributed samples {x^\x^ 2 \ . . . ,x^} 
drawn from the distribution specified by some Markov random field. As a 
concrete illustration, for binary random variables, each vector-valued sam- 
ple £ {0, 1} P might correspond to the votes of a set of p politicians on 
a particular bill, and estimating the graph structure amounts to detecting 
statistical dependencies in these voting patterns (see Banerjee, Ghaoui and 
d'Aspremont [2] for further discussion of this example). 

Due to both its importance and difficulty, the problem of structure learn- 
ing for discrete graphical models has attracted considerable attention. The 
absence of an edge in a graphical model encodes a conditional independence 
assumption. Constraint-based approaches [30] estimate these conditional in- 
dependencies from the data using hypothesis testing and then determine a 
graph that most closely represents those independencies. Each graph repre- 
sents a model class of graphical models; learning a graph then is a model 
class selection problem. Score-based approaches combine a metric for the 
complexity of the graph with a measure of the goodness of fit of the graph 
to the data; for instance, log-likelihood of the maximum likelihood param- 
eters given the graph, to obtain a score for each graph. The score is used 
together with a search procedure that generates candidate graph structures 
to be scored. The number of graph structures grows super-exponentially, 
however, and Chickering [6] shows that this problem is in general NP-hard. 

A complication for undirected graphical models involving discrete random 
variables is that typical score metrics involve the partition function or cu- 
mulant function associated with the Markov random field. For general undi- 
rected MRFs, calculation of this partition function is computationally in- 
tractable [36] . The space of candidate structures in scoring based approaches 
is thus typically restricted to either directed graphical models [10] or to sim- 
ple sub-classes of undirected graphical models such as those based on trees 
[7] and hypertrees [31]. Abbeel, Koller and Ng [1] propose a method for 
learning factor graphs based on local conditional entropies and thresholding 
and analyze its behavior in terms of Kullback-Leibler divergence between 
the fitted and true models. They obtain a sample complexity that grows log- 
arithmically in the number of vertices p, but the computational complexity 
grows at least as quickly as 0(p d+1 ) where d is the maximum neighborhood 
size in the graphical model. This order of complexity arises from the fact 
that for each node, there are (p) = Q{p d ) possible neighborhoods of size d 
for a graph with p vertices. Csiszar and Talata [9] show consistency of a 
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method that uses pseudo-likelihood and a modification of the BIC criterion, 
but this also involves a prohibitively expensive search. 

The main contribution of this paper is a careful analysis of the compu- 
tational and statistical efficiency of a simple method for graphical model 
selection. The basic approach is straightforward: it involves performing l\- 
regularized logistic regression of each variable on the remaining variables, 
and then using the sparsity pattern of the regression vector to infer the un- 
derlying neighborhood structure. Our analysis is high-dimensional in nature, 
meaning that both the model dimension p as well as the maximum neighbor- 
hood size d may tend to infinity as a function of the size n. Our main result 
shows that under mild assumptions on the population Fisher information 
matrix, consistent neighborhood selection is possible using n = Q^cPlogp) 
samples and computational complexity 0(max{n,p}p 3 ). We also show that 
when the same assumptions are imposed directly on the sample matrices, 
n = Q,(d 2 logp) samples suffice for consistent neighborhood selection with 
the same computational complexity. We focus in this paper on binary Ising 
models, but indicate in Section 7 a generalization of the method applicable 
to general discrete Markov random fields. 

The technique of ^i-regularization for estimation of sparse models or sig- 
nals has a long history in many fields (for instance, see [32] for one survey). 
A surge of recent work has shown that £i-regularization can lead to practical 
algorithms with strong theoretical guarantees (e.g., [5, 12, 23, 24, 32, 33, 39]). 
Despite the well-known computational intractability of computing marginals 
and likelihoods for discrete MRFs [36], our method is computationally ef- 
ficient; it involves neither computing the normalization constant (or parti- 
tion function) associated with the Markov random field nor a combinatorial 
search through the space of graph structures. Rather, it requires only the 
solution of standard convex programs with an overall computational com- 
plexity of order 0(max{p, n}p 3 ) and is thus well suited to high-dimensional 
problems [20]. Conceptually, like the work of Meinshausen and Buhlmann 
[23] on covariance selection in Gaussian graphical models, our approach can 
be understood as using a type of pseudo-likelihood based on the local con- 
ditional likelihood at each node. In contrast to the Gaussian case, where the 
exact maximum likelihood estimate can be computed exactly in polynomial 
time, this use of a surrogate loss function is essential for discrete Markov 
random fields given the intractability of computing the exact likelihood [36] . 

Portions of this work were initially reported in a conference publication 
[35], with the weaker result that n = £l(d e log d + d 5 logp) samples suffice 
for consistent Ising model selection. Since the appearance of that paper, 
other researchers have also studied the problem of model selection in dis- 
crete Markov random fields. For the special case of bounded degree models, 
Bresler, Mossel and Sly [4] describe a simple search-based method, and prove 
under relatively mild assumptions that it can recover the graph structure 
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with ©(logp) samples. However, in the absence of additional restrictions, 
the computational complexity of the method is 0(p d+1 ). In other work, 
Santhanam and Wainwright [29] analyze the information-theoretic limits of 
graphical model selection, providing both upper and lower bounds on var- 
ious model selection procedures, but these methods also have prohibitive 
computational costs. 

The remainder of this paper is organized as follows. We begin in Section 2 
with background on discrete graphical models, the model selection problem 
and logistic regression. In Section 3, we state our main result, develop some 
of its consequences and provide a high-level outline of the proof. Section 4 is 
devoted to proving a result under stronger assumptions on the sample Fisher 
information matrix whereas Section 5 provides concentration results linking 
the population matrices to the sample versions. In Section 6, we provide some 
experimental results that illustrate the practical performance of our method 
and the close agreement between theory and practice. Section 7 discusses an 
extension to more general Markov random fields, and we conclude in Section 
8. 

Notation. For the convenience of the reader, we summarize here notation 
to be used throughout the paper. We use the following standard notation 
for asymptotics: we write f(n) = 0{g{n)) if f(n) < Kg{n) for some constant 
K < oo, and f(n) = £l(g(n)) if /(n) > K'g(n) for some constant K' > 0. The 
notation f(n) = @(g(n)) means that f(n) = 0{g{n)) and f(n) = Q(g(n)). 
Given a vector v G M. d and parameter q G [l,oo], we use \\v\\ q to denote 
the usual £ q norm. Given a matrix A G M axb and parameter q G [1, oo], we 
use HI .A HI 5 to denote the induced matrix-operator norm with A viewed as a 
mapping from i b q — > £^ (see Horn and Johnson [16]). Two examples of par- 
ticular importance in this paper are the spectral norm |||A|||2, correspond- 
ing to the maximal singular value of A, and the £oo matrix norm, given by 
|||-A|||oo = m & x j=l,...,a Sfc=ilA?'fcl- We make use of the bound Hl^oo < v^aHAH^ 
for any symmetric matrix A G M axa . 

2. Background and problem formulation. We begin by providing some 
background on Markov random fields, defining the problem of graphical 
model selection and describing our method based on neighborhood logistic 
regression. 

2.1. Pairwise Markov random fields. Let X = (Xi,X2, . . . ,X p ) denote a 
random vector with each variable X s taking values in a corresponding set X s . 
Say we are given an undirected graph G with vertex set V = {1, . . . ,p} and 
edge set E, so that each random variable X s is associated with a vertex s G 
V. The pairwise Markov random field associated with the graph G over the 
random vector X is the family of distributions of X which factorize as ¥(x) oc 
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ex P{S( s ,t)e£;^*( :r s' x *)} where for each edge (s,t) G E, 4> st is a mapping 
from pairs (x s ,xt) G X s x X t to the real line. For models involving discrete 
random variables, the pairwise assumption involves no loss of generality since 
any Markov random field with higher-order interactions can be converted 
(by introducing additional variables) to an equivalent Markov random field 
with purely pairwise interactions (see Wainwright and Jordan [34] for details 
of this procedure). 

Ising model. In this paper, we focus on the special case of the Ising 
model in which X s G { — 1, 1} for each vertex s £V, and (p s t(x s ,xt) = 9* t x s Xt 
for some parameter 9* t G M, so that the distribution takes the form 




The partition function Z{6*) ensures that the distribution sums to one. This 
model is used in many applications of spatial statistics such as modeling the 
behavior of gases or magnets in statistical physics [17], building statistical 
models in computer vision [13] and social network analysis. 

2.2. Graphical model selection. Suppose that we are given a collection 
j£" := {x^ l \ . . . ,x^} of n samples where each p-dimensional vector x^> G 
{ — 1,+1} P is drawn in an i.i.d. manner from a distribution Fq* of the form 
(1) for parameter vector 9* and graph G = (V,E) over the p variables. It 
is convenient to view the parameter vector 9* as a (2) -dimensional vector, 
indexed by pairs of distinct vertices but nonzero if and only if the vertex pair 
(s,t) belongs to the unknown edge set E of the underlying graph G. The 
goal of graphical model selection is to infer the edge set E. In this paper, we 
study the slightly stronger criterion of signed edge recovery; in particular, 
given a graphical model with parameter 6*, we define the edge sign vector 

i 2 ) E * . = J signet); if ( s > £ E , 

\ 0, otherwise. 

Here the sign function takes value +1 if 9* t > 0, value —1 if 9* t < and 0, 
otherwise. Note that the weaker graphical model selection problem amounts 
to recovering the vector \E*\ of absolute values. 

The classical notion of statistical consistency applies to the limiting be- 
havior of an estimation procedure as the sample size n goes to infinity with 
the model size p itself remaining fixed. In many contemporary applications 
of graphical models — among them gene microarray data and social network 
analysis — the model dimension p is comparable to or larger than the sample 
size n, so that the relevance of such "fixed p" asymptotics is limited. With 
this motivation, our analysis in this paper is of the high-dimensional nature, 
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in which both the model dimension and the sample size are allowed to in- 
crease, and we study the scalings under which consistent model selection is 
achievable. 

More precisely, we consider sequences of graphical model selection prob- 
lems, indexed by the sample size n, number of vertices p and maximum node 
degree d. We assume that the sample size n goes to infinity, and both the 
problem dimension p = p(n) and d = d{n) may also scale as a function of 
n. The setting of fixed p or d is covered as a special case. Let E n be an 
estimator of the signed edge pattern E* based on the n samples. Our goal 
is to establish sufficient conditions on the scaling of the triple (n,p,d) such 
that our proposed estimator is consistent in the sense that 

¥[E n = E*] ->• 1 as n -»• +00. 

We sometimes call this property sparsistency, as a shorthand for consistency 
of the sparsity pattern of the parameter 6*. 

2.3. Neighborhood-based logistic regression. Recovering the signed edge 
vector E* of an undirected graph G is equivalent to recovering, for each 
vertex r G V, its neighborhood set M(r) := {t G V \ (r,t) G E} along with the 
correct signs sign(#* 4 ) for all t SjV(r). To capture both the neighborhood 
structure and sign pattern, we define the product set of "signed vertices" as 
{—1,1} x V. We use the shorthand "tr" for elements (t,r) G {—1,1} x V. 
We then define the signed neighborhood set as 

(3) Af ± (r):={ S ign(e* rt )t\te^(r)}. 

Here the sign function has an unambiguous definition, since 9* t 7^ for all 
t G Af(r). Observe that this signed neighborhood set Af±(r) can be recov- 
ered from the sign-sparsity pattern of the (p — l)-dimensional subvector of 
parameters 



V 



{e* ru ,uGV\r}, 



associated with vertex r. In order to estimate this vector 9* r , we consider 
the structure of the conditional distribution of X r given the other variables 
X\ r = {Xt 1 1 G V \ {r}}. A simple calculation shows that under the model 
(1), this conditional distribution takes the form 

N _ exp(2x r E fgVV ^t) 

( j ^^ l ^ j -exp(2x r E* 6 nr^) + l' 

Thus the variable X r can be viewed as the response variable in a logis- 
tic regression in which all of the other variables X\~ play the role of the 
covariates. 
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With this set-up, our method for estimating the sign-sparsity pattern of 
the regression vector 9* r and hence the neighborhood structure A/±(r) is 

based on computing an l\ -regularized logistic regression of X r on the other 
variables X\ r . Explicitly, given 36™ = {x^\x^ 2 \ . . . ,x^}, a set of n i.i.d. 
samples, this regularized regression problem is a convex program of the 
form 

(5) min AW^ + X^^M, 
where 

1 n 

(6) *(0;X?):=--VbgP,(s«|x«) 

n ^— ' v 

i=l 

is the rescaled negative log likelihood (the rescaling factor 1/n in this defi- 
nition is for later theoretical convenience) and A( n pjC ;) > is a regularization 
parameter, to be specified by the user. For notational convenience, we will 
also use A n as notation for this regularization parameter suppressing the 
potential dependence on p and d. 

Following some algebraic manipulation, the regularized negative log like- 
lihood can be written as 



min -V/t^O)- V ^Mru + AnHMi 
v V i=l ueV\r 

where 

(8) f(0;x) := logl expf ^ rt x t J + exp ( - ^ 6 rt x t 

\ev\r ' ^ tev\r 

is a rescaled logistic loss, and ju rM := ^X^i^r i« are empirical moments. 
Note the objective function (7) is convex but not differentiable, due to the 
presence of the ^i-regularizer. By Lagrangian duality, the problem (7) can be 
re-cast as a constrained problem over the ball ||#\ r ||i < C(X n ). Consequently, 
by the Weierstrass theorem, the minimum over 0\ s is always achieved. 

Accordingly, let 6™ r be an element of the minimizing set of problem (7). 

Although 0™ r need not be unique in general since the problem (7) need not 
be strictly convex, our analysis shows that in the regime of interest, this min- 
imizer 9™ r is indeed unique. We use 9™ r to estimate the signed neighborhood 
A/±(r) according to 

(9) A4(r) := {sign(0 ru )tt | u € V \ r, 9 SU ^ 0}. 

We say that the full graph G is estimated consistently, written as the event 
{E n = E*}, if every signed neighborhood is recovered — that is, Af±(r) = 
Af±(r) for all reV. 
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3. Method and theoretical guarantees. Our main result concerns con- 
ditions on the sample size n relative to the parameters of the graphical 
model — more specifically, the number of nodes p and maximum node degree 
d — that ensure that the collection of signed neighborhood estimates (9), one 
for each node r of the graph, agree with the true neighborhoods so that 
the full graph is estimated consistently. In this section, we begin by stating 
the assumptions that underlie our analysis, and then give a precise state- 
ment of the main result. We then provide a high-level overview of the key 
steps involved in its proof, deferring details to later sections. Our analysis 
proceeds by first establishing sufficient conditions for correct signed neigh- 
borhood recovery — that is, {N±{r) = M±(r)} — for some fixed node r G V. 
By showing that this neighborhood consistency is achieved at sufficiently 
fast rates, we can then use a union bound over all p nodes of the graph to 
conclude that consistent graph selection is also achieved. 



3.1. Assumptions. Success of our method requires certain assumptions 
on the structure of the logistic regression problem. These assumptions are 
stated in terms of the Hessian of the likelihood function EjlogP^A'j. | Xy]} 
as evaluated at the true model parameter 6* r £ MP" 1 . More specifically, for 
any fixed node r S V, this Hessian is a (p — 1) x (p — 1) matrix of the form 

(10) Q* r := E,*{V 2 logP e . [X r | X v }}. 

For future reference, this is given as the explicit expression 

(11) Q* r =E e *[ V (X;e*)X v xT], 
where 

4exp(2f/ r 52tev\r ^rtu t ) 



(12) v(u;6):-- 



(exp(2u r ^2tev\r Qrtut) + 1)' 



is the variance function. Note that the matrix Q* is the Fisher informa- 
tion matrix associated with the local conditional probability distribution. 
Intuitively, it serves as the counterpart for discrete graphical models of the 
covariance matrix E[XX T ] of Gaussian graphical models, and indeed our 
assumptions are analogous to those imposed in previous work on the Lasso 
for Gaussian linear regression [23, 32, 39]. 

In the following we write simply Q* for the matrix Q* where the reference 
node r should be understood implicitly. Moreover, we use S := {(r, t) 1 1 € 
M(r)} to denote the subset of indices associated with edges of r, and S c to 
denote its complement. We use Q* ss to denote the d x d sub-matrix of Q* 
indexed by 5. With this notation, we state our assumptions: 
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(Al) Dependency condition. The subset of the Fisher information ma- 
trix corresponding to the relevant covariates has bounded eigenvalues; that 
is, there exists a constant C mm > such that 

(13) A m i n (Q* ss ) > C m ; n . 

Moreover, we require that A max (¥,$*[X\ r X^ r ]) < D max . These conditions en- 
sure that the relevant covariates do not become overly dependent. (As stated 
earlier, we have suppressed notational dependence on r; thus these condi- 
tions are assumed to hold for each r G V.) 



( A2) Incoherence condition. Our next assumption captures the intuition 
that the large number of irrelevant covariates (i.e., nonneighbors of node r) 
cannot exert an overly strong effect on the subset of relevant covariates (i.e., 
neighbors of node r). To formalize this intuition, we require the existence of 
an q G (0, 1] such that 

(14) IQsesWssT^oo <!-<*. 



3.2. Statement of main result. We are now ready to state our main result 
on the performance of neighborhood logistic regression for graphical model 
selection. Naturally, the limits of model selection are determined by the 
minimum value over the parameters 9* t for pairs (r,t) included in the edge 
set of the true graph. Accordingly, we define the parameter 

(15) 0^ in = min \9* t \. 

{r,i)eE 

With this definition, we have the following: 



Theorem 1. Consider an Ising graphical model with parameter vector 
9* and associated edge set E* such that conditions (Al) and (A2) are satis- 
fied by the population Fisher information matrix Q* , and let 3t" be a set of n 
i.i.d. samples from the model specified by 9* . Suppose that the regularization 
parameter A n is selected to satisfy 

a V n 

Then there exist positive constants L and K, independent of (n,p,d), such 
that if 

(17) n>Ld 3 logp, 

then the following properties hold with probability at least 1 — 2exp(— K\ 2 n n). 
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(a) For each node r G V, the ^-regularized logistic regression (5), given 
data Si™, has a unique solution, and so uniquely specifies a signed neigh- 
borhood j\f± (r) . 

(b) For each r 6 V, the estimated signed neighborhood M± (r) correctly ex- 
cludes all edges not in the true neighborhood. Moreover, it correctly in- 
cludes all edges (r,t) for which \9*. f \ > VdX n ■ 

The theorem not only specifies sufficient conditions but also the probabil- 
ity with which the method recovers the true signed edge-set. This probability 
decays exponentially as a function of A^n which leads naturally to the fol- 
lowing corollary on model selection consistency of the method for a sequence 
of Ising models specified by (n,p(n),d(n)). 

Corollary 1. Consider a sequence of Ising models with graph edge 
sets {E*/ n \} and parameters {6( np d)}> eac ^ °f which satisfies conditions 
(Al) and (A2). For each n, let X™ be a set of n i.i.d. samples from the 
model specified by 9* n p d y and suppose that (n,p(n),d(n)) satisfies the scaling 
condition (17) of Theorem 1. Suppose further that the sequence {A n } of 
regularization parameters satisfies condition (16) and 

(18) A^n^oo 
and the minimum parameter weights satisfy 

(19) mi. v \e\ n>p>d) (r,t)\> -^-Vd\ n 

for sufficiently large n. Then the method is model selection consistent so that 
if -E'p(n) is the graph structure estimated by the method given data X™, then 

F[E p(n) = E* {n) ] -> 1 asn^oo. 

Remarks, (a) It is worth noting that the scaling condition (17) on (n,p, d) 
allows for graphs and sample sizes in the "large p, small n" regime (meaning 
p^> n), as long as the degrees are bounded, or grow at a sufficiently slow 
rate. In particular, one set of sufficient conditions are the scalings 

d = 0{n Cl ) and p = 0{e nC2 ), 3ci+c 2 <l, 

for some constants c\,C2 > 0. Under these scalings, note that we have d? log(p) 
C(n 3ci+C2 ) = o(n), so that condition (17) holds. 

A bit more generally, note that in the regime p>n, the growth condition 
(17) requires that that d = o(p). However, in many practical applications of 
graphical models (e.g., image analysis, social networks), one is interested in 
node degrees d that remain bounded or grow sub-linearly in the graph size 
so that this condition is not unreasonable. 
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(b) Loosely stated, the theorem requires that the edge weights are not too 
close to zero (in absolute value) for the method to estimate the true graph. 
In particular, conditions (16) and (19) imply that the minimum edge weight 
9* nin is required to scale as 



Note that in the classical fixed (p, d) case, this reduces to the familiar scaling 
requirement of # min = $7(n -1 / 2 ). 

(c) In the high-dimensional setting (for p— > +oo), a choice of the regular- 
ization parameter satisfying both conditions (16) and (18) is, for example, 



for which the probability of incorrect model selection decays at rate 
C(exp(— K' logp)) for some constant K' > 0. In the classical setting (fixed p), 



The analysis required to prove Theorem 1 can be divided naturally into 
two parts. First, in Section 4, we prove a result (stated as Proposition 1) for 
"fixed design" matrices. More precisely, we show that if the dependence con- 
dition (Al) and the mutual incoherence condition (A2) hold for the sample 
Fisher information matrix 



then the growth condition (17) and choice of A n from Theorem 1 are suffi- 
cient to ensure that the graph is recovered with high probability. 

The second part of the analysis, provided in Section 5, is devoted to show- 
ing that under the specified growth condition (17), imposing incoherence and 
dependence assumptions on the population version of the Fisher informa- 
tion Q* guarantees (with high probability) that analogous conditions hold 
for the sample quantities Q n . On one hand, it follows immediately from 
the law of large numbers that the empirical Fisher information Q\ A con- 
verges to the population version Q\ A for any fixed subset A. However, in 
the current setting, the added delicacy is that we are required to control 
this convergence over subsets of increasing size. Our proof therefore requires 
some large-deviation analysis for random matrices with dependent elements 
so as to provide exponential control on the rates of convergence. 






(20) 



1 . 



1=1 
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3.3. Primal-dual witness for graph recovery. At the core of our proof 
lies the notion of a primal-dual witness used in previous work on the Lasso 
[33]. In particular, our proof involves the explicit construction of an opti- 
mal primal- dual pair — namely, a primal solution 9 G W^ 1 along with an 
associated subgradient vector z £ W -1 (which can be interpreted as a dual 
solution), such that the sub-gradient optimality conditions associated with 
the convex program (7) are satisfied. Moreover, we show that under the 
stated assumptions on (n,p, d), the primal-dual pair (#, z) can be constructed 
such that they act as a witness — that is, a certificate guaranteeing that the 
method correctly recovers the graph structure. 

For the convex program (7), the zero sub-gradient optimality conditions 
[27] take the form 

(21) V£(6) + \ n z = 0, 

where the dual or subgradient vector z £ must satisfy the properties 

(22) z r t = sign(0 r () if 0, ^ and \z r t\ < 1 otherwise. 

By convexity, a pair (9,z) £ W" 1 x W 3 " 1 is a primal-dual optimal solution 
to the convex program and its dual if and only if the two conditions (21) 
and (22) are satisfied. Of primary interest to us is the property that such 
an optimal primal-dual pair correctly specifies the signed neighborhood of 
node r; the necessary and sufficient conditions for such correctness are 

(23a) sign(z rt ) = sign(^) V(r, t) £ S := {(r, t) £ E} and 

(23b) 9 ru = for all (r, u) G S c := E \ S. 

The ^i-regularized logistic regression problem (7) is convex; however, for 
p^> n, it need not be strictly convex, so that there may be multiple optimal 
solutions. The following lemma, proved in Appendix A, provides sufficient 
conditions for shared sparsity among optimal solutions, as well as uniqueness 
of the optimal solution: 

Lemma 1. Suppose that there exists an optimal primal solution 9 with 
associated optimal dual vector z such that \\'zs c \\oo < 1- Then any optimal 
primal solution 9 must have 9s c = 0. Moreover, if the Hessian sub-matrix 
\S7 2 (-{9)]ss is strictly positive definite, then 9 is the unique optimal solution. 

Based on this lemma, we construct a primal-dual witness (9,z) with the 
following steps. 

(a) First, we set 9s as the minimizer of the partial penalized likelihood 
(24) 9 S = argmin {£(9; 3EJ) + X n \\9 s \\i} 

and set % = sign(6*s). 
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(b) Second, we set 9$? = so that condition (23b) holds. 

(c) In the third step, we obtain zs c from (21) by substituting in the values 
of 9 and zs- Thus our construction satisfies conditions (23b) and (21). 

(d) The final and most challenging step consists of showing that the stated 
scalings of (n,p,d) imply that, with high-probability, the remaining con- 
ditions (23a) and (22) are satisfied. 

Our analysis in step (d) guarantees that 1 1 1 1 cxd < 1 with high probability. 
Moreover, under the conditions of Theorem 1, we prove that the sub-matrix 
of the sample Fisher information matrix is strictly positive definite with high 
probability so that by Lemma 1, the primal solution 9 is guaranteed to be 
unique. 

It should be noted that, since S is unknown, the primal-dual witness 
method is not a practical algorithm that could ever be implemented to 
solve t\ -regularized logistic regression. Rather, it is a proof technique that 
allows us to establish sign correctness of the unique optimal solution. 

4. Analysis under sample Fisher matrix assumptions. We begin by es- 
tablishing model selection consistency when assumptions are imposed di- 
rectly on the sample Fisher matrix Q n , as opposed to on the population 
matrix Q* , as in Theorem 1. In particular, recalling the definition (20) of 
the sample Fisher information matrix Q n = E[V 2 £(#*)], we define the "good 
event," 

(25) M(3%) := {£? € {-l,+l} nxp | Q n satisfies (Al) and (A2)}. 

As in the statement of Theorem 1, the quantities L and K refer to constants 
independent of (n,p,d). With this notation, we have the following: 

Proposition 1 (Fixed design). If the event M(3P{) holds, the sample 
size satisfies n > Ld 2 log(j>) , and the regularization parameter is chosen such 

that X n > 16 (^~ a ) ^J]£SE ! then with probability at least l — 2exp(—KX 2 l n) — > 1, 
the following properties hold. 

(a) For each node r S V , the i\-regularized logistic regression has a unique 
solution, and so uniquely specifies a signed neighborhood M± (r) . 

(b) For each r £ V , the estimated signed neighborhood vector J\f±(r) cor- 
rectly excludes all edges not in the true neighborhood. Moreover, it correctly 
includes all edges with \9 r A > r , 10 \fd\ n . 

^min 

Loosely stated, this result guarantees that if the sample Fisher informa- 
tion matrix is "good," then the conditional probability of successful graph 
recovery converges to zero at the specified rate. The remainder of this section 
is devoted to the proof of Proposition 1. 
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4.1. Key technical results. We begin with statements of some key techni- 
cal lemmas that are central to our main argument with their proofs deferred 
to Appendix B. The central object is the following expansion obtained by 
re-writing the zero-subgradient condition as 

(26) V£{9; - V£{9* ; X? ) = W n - X n z, 

where we have introduced the short-hand notation W n = — W(#*;X") for 
the (p — l)-dimensional score function, 

W n._ 1 ^ x (i)L(i) e MT,t£V\r *rt4 ) )-eM-T,t£V\r d rt4 l) ) \ 

n ^ x v <^x r exp(E ^ ^ a ,( i)) + exp( _ / ■ 

For future reference, note that Ee*[W n ] = 0. Next, applying the mean-value 
theorem coordinate-wise to the expansion (26) yields 

(27) V 2 £(0*;X%)[9-9*] = W n -\ n z + R n , 
where the remainder term takes the form 

(28) R$ = [V 2 £(0W;3%) - V 2 £(9*;3%)]J(9- 9*) 

with fiki) a parameter vector on the line between 9* and 9, and with [-]J 
denoting the jth row of the matrix. The following lemma addresses the 
behavior of the term W n in this expansion: 

Lemma 2. For the specified mutual incoherence parameter a S (0, 1], we 
have 

(29) pf^||inU>^) <2expf- i " 2A " ^ n + logfr) 



128(2 - a) 2 ' 

which converges to zero at rate exp(— cA 2 n) as long as X n > 
See Appendix B.l for the proof of this claim. 

The following lemma establishes that the sub-vector 9$ is an ^-consistent 
estimate of the true sub- vector 9* s : 

c 2 

Lemma 3 (^-consistency of primal subvector). If X n d < 10 ff in and 
H^loc < A n /4, then 

(30) \\e s -9sh< 7 ^Vd\ n . 
See Appendix B.2 for the proof of this claim. 

Our final technical lemma provides control on the remainder term (28). 
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Lemma 4. // X n d < i 00 g^ ax 2^ and H^Hoc < A n /4, then 

II^IU 25ZW, a 
< —^5 A n d< 



An " C^ in "4(2 -a) 
See Appendix B.3 for the proof of this claim. 

4.2. Proof of Proposition 1. Using these lemmas, the proof of Proposi- 
tion 1 is straightforward. Consider the choice of the regularization param- 
eter, A n = 16^^jp. This choice satisfies the condition of Lemma 2, so 

that we may conclude that with probability greater than 1 — 2 exp(— cA 2 n) — > 
1, we have 

a A A 



V^H < < - 

" 2-«4 - 4 



using the fact that a < 1. The remaining two conditions that we need to 
apply the technical lemmas concern upper bounds on the quantity X n d. In 
particular, for a sample size satisfying n > 10 ^,f )max ^ 2 ~" ^ d 2 \ogp, we have 



A d= 16(2 - a ) . /logp 



< 



a V n 
16C1 a 



mm 



100ZW (2 - a) 

a 2 



^ mm 



WD max 

so that the conditions of both Lemmas 3 and 4 are satisfied. 

We can now proceed to the proof of Proposition 1 . Recalling our shorthand 
Q n = \7q£(6*;X^) and the fact that we have set 6s c = in our primal-dual 
construction, we can re-write condition (27) in block form as 

(31a) Q r ^ cs [9 S - e* s ] = W§ c - X n z S c + R n s * , 

(31b) Qn s[ s _0* ]=W n_ XnSs + R n^ 

Since the matrix Q^g is invertible by assumption, the conditions (31) can 
be re-written as 

(32) Q&siQssr^Wg - X n z s + R n s ) = WBc - X n z S c + R n sc . 
Rearranging yields the condition, 

(33) [W& - R n sc ] - Q^siQssr'm - Rs] + XnQ^siQssT^s = X n z S c 
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Strict dual feasibility. We now demonstrate that for the dual sub- vector 
zsc defined by (33), we have ||£s c ||oo < 1. Using the triangle inequality and 
the mutual incoherence bound (14), we have that 



(34) Halloo ^IIIQ^QSsr 1 ! 



I T/t/ n II II R n 1 1 



An A,- 



II F? n II llT/T/ n II 

1 1 flgc 1 1 oo || "go 1 1 oo 



An A n 

(35) <(l-a) + (2-a 



-R n ||oo ||W^ n ||oo 



An A,- 



Next, applying Lemmas 2 and 4, we have 

\\zs4oo <{l-a) + - + - = l-- 
with probability converging to one. 

Correct sign recovery. We next show that our primal sub-vector 6$ de- 
fined by (24) satisfies sign consistency, meaning that sgn(6>g) = sgn(0£). In 
order to do so, it suffices to show that 

HQ Q* II min 

\\VS - Pglloc < 

recalling the notation 0^ in := min( rt ) g £ |0* t |. From Lemma 3, we have ||0g — 
01? II 2 < 7^— v^An so that 

° ^min 

2 2 

^— ll#s - ^slloo < -^—\\°s - Q*sh 



< -J-TT—^n, 

which is less than one as long as 0* in > 7^-v^An. 

mill ^min 

5. Uniform convergence of sample information matrices. In this section 
we complete the proof of Theorem l by showing that if the dependency (Al) 
and incoherence (A2) assumptions are imposed on the population Fisher 
information matrix then under the specified scaling of (n,p,d), analogous 
bounds hold for the sample Fisher information matrices with probability 
converging to one. These results are not immediate consequences of classi- 
cal random matrix theory (e.g., [II]) since the elements of Q n are highly 
dependent. Recall the definitions 



(36) Q* :=E e *[v(X;e*)X v xT] and Q n := E[rj(X; 9* )X V X$], 
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where Eg* denotes the population expectation, and E denotes the empirical 
expectation, and the variance function n was defined previously in (12). The 
following lemma asserts that the eigenvalue bounds in assumption (Al) hold 
with high probability for sample covariance matrices: 



Lemma 5. Suppose that assumption (Al) holds for the population ma- 
trix Q* and ~Eq*[XX t ]. For any 5 > and some fixed constants A and B, 
we have 



(37a) P 
(37b) 



A r 



> £>max + 5 



-A 5 -^ + B\og(d) 



5 2 n 



P[Amin(Qss) < <?min " 5] < 2 exp ( -A-g- + fllog(d) 



The following result is the analog for the incoherence assumption (A2) 
showing that the scaling of (n,p,d) given in Theorem 1 guarantees that 
population incoherence implies sample incoherence. 



Lemma 6. If the population covariance satisfies a mutual incoherence 
condition (14) with parameter a £ (0,1] as in assumption (A2), then the 
sample matrix satisfies an analogous version, with high probability in the 
sense that 



(38) 



i\Qss) 



>l-2 

2 



< 



exp 



n 



-K-p + log(p) 



Proofs of these two lemmas are provided in the following sections. Before 
proceeding, we take note of a simple bound to be used repeatedly throughout 
our arguments. By definition of the matrices Q n {9) and Q{9) [see (20) and 
(11)], the (j, k)ih element of the difference matrix Q n (0) — Q{6) can be 
written as an i.i.d. sum of the form Zjk = ^ Yli=i ^jk where each Zj^ is 
zero-mean and bounded (in particular, \Z^ \ < 4). By the Azuma-Hoeffding 
bound [15], for any indices j, k = 1, . . . , d and for any e > 0, we have 

(39) F[(Z jk ) 2 >e 2 ]=P 

So as to simplify notation, throughout this section, we use K to denote a 
universal positive constant, independent of (n,p,d). Note that the precise 
value and meaning of K may differ from line to line. 



n 



( 

k 



> £ 



< 2 exp 



e 2 n 
~32" 
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5.1. Proof of Lemma 5. By the Courant -Fischer variational representa- 
tion [16], we have 

Amin(Qss) = min x T Q S sx 

\\x\\2=l 

= min {x T Q n ss x + x T {Qss-Q n ss)x} 

IM|2 = 1 

<y T Qs S y + y T (Qss-Q n ss)y, 

where y G M d is a unit- norm minimal eigenvector of Q'gg. Therefore, we have 

A m in(Q5s) ^ A-mm(Qss) ~ \\\QsS ~ Qsslh ^ ^min — IIIQsS ~ Qsslh- 



Hence it suffices to obtain a bound on the spectral norm \\\Qss — Qsslh- 
Observe that 



/ d d \ 

\7=1 k=l / 



1/2 



Setting e 2 = b 2 jd 2 in (39) and applying the union bound over the d 2 index 
pairs (j, k) then yields 

,5 2 n 



(40) ^LIH^SS 
Similarly, we have 

< 



Qsslh > $} < 2exp( + 21og(d) ) . 



1 



> i=l 



n 



i=l 



>5 



which obeys the same upper bound (40) by following the analogous argu- 
ment. 

5.2. Proof of Lemma 6. We begin by decomposing the sample matrix as 
the sum QscgiQgg)" 1 = T\ + T 2 + T3 + T4 where we define 

(41a) T^Ql^msY 1 -{Q* ss )-\ 

(41b) T 2 :=[Q n ScS -Q* ScS ](Q* ss )-\ 

(41c) T 3 := [Qg c5 - Qs^KQss)- 1 ~ (Qh)' 1 }, 

(41d) Ti-Q'scsWss)- 1 . 

The fourth term is easily controlled; indeed, we have 

III^IHoo = IWQhsiQss)' 1 ^ <l~OL 
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by the incoherence assumption (A2). If we can show that |||Ti Hloc < ^ for the 
remaining indices i = 1,2,3, then by our four term decomposition and the 
triangle inequality, the sample version satisfies the bound (38), as claimed. 
We deal with these remaining terms using the following lemmas: 

Lemma 7. For any 5 > and constants K, K' , the following bounds hold: 



(42a) 



(42b) 



niQscs-QU\\\°c>8] 

( n5 2 \ 

< 2 exp I -K + log(d) + log(p - d) J ; 

n\\Qss-Q*ss\\\oo>5} 

( nb 2 \ 

< 2 exp /- J fC_ + 21og(d) J; 



(42c) 



( n5 2 \ 
<4expf -K-p+K'\og{d) J. 

See Appendix C for the proof of these claims. 

Control of first term. Turning to the first term, we re-factorize it as 

^i — Q*s c s(Q*ss) 1 [Qss ~ Q*ss](Qss) 1 

and then bound it (using the sub-multiplicative property |||^4i3||| < 
|||^4|||oo|||-B|||oo) as follows: 

lll^lllloo < \\\Q*S c s(Q*Ss) 1 \\\oo\\\QsS ~ Q*Ss\\\oo\\\(Qss) ^lloo 

< (1 - a)\\\Q n ss - Qh\\\oo{Vd\\\(Q n ss )~%}, 

where we have used the incoherence assumption (A2). Using the bound (37b) 
from Lemma 5 with 5 = C m i n /2, we have HKQ^s) -1 1||2 = [Amin(Qss)] _1 < 
-p— with probability greater than 1 — exp (-Kn/d 2 + 21og(d)). Next, ap- 

^min 



plying the bound (42b) with 5 = c/y/d, we conclude that with probability 
greater than 1 — 2exp(— Knc 2 /d 3 + log(d)), we have 

\\\Q n ss-Q*ss\\\oo<c/Vd. 

By choosing the constant c > sufficiently small, we are guaranteed that 

(43) PRUT! IIU > a/6] < 2 exp I -K-^ + log(d) 
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Control of second term. To bound T2, we first write 

11^2 HI oo < ^d\\\(Q*ss)~ 1 \h\\\Qs c s - Q*S c s\\\oo 

^ ~F, IIQis^S ~~ Q*S c s\\\oo- 
^min 

We then apply bound (42a) with 5 = f-^p to conclude that 



(44) P[|||T 2 IHoo > a/3] < 2 exp (-K^ + log(p - d)j . 

Control of third term. Finally, in order to bound the third term T3, we 
apply the bounds (42a) and (42b), both with 5= -y/a/3, and use the fact 
that log(cZ) < log(p — d) to conclude that 

(45) P[|||r 3 |||oo > a/3] < 4expf-if + log(p - d)) . 
Putting together all of the pieces, we conclude that 

P[|||^5(«5)" 1 Hloo > 1 - a/2] =o(e X p(-K^ +log(p) 



as claimed. 



6. Experimental results. We now describe experimental results that il- 
lustrate some consequences of Theorem 1, for various types of graphs and 
scalings of (n,p,d). In all cases, we solved the ^-regularized logistic regres- 
sion using special purpose interior-point code developed by Koh, Kim and 
Boyd [20]. 

We performed experiments for three different classes of graphs: four- 
nearest neighbor lattices, (b) eight-nearest neighbor lattices and (c) star- 
shaped graphs as illustrated in Figure 1 . Given a distribution Pg* of the Ising 
form (1), we generated random data sets {x^\ . . . ,x^} by Gibbs sampling 
for the lattice models, and by exact sampling for the star graph. For a given 
graph class and edge strength u> > 0, we examined the performance of mod- 
els with mixed couplings meaning that 9* t = dzu with equal probability or 
with positive couplings meaning that 9* t = to for all edges (s,t). In all cases, 

we set the regularization parameter A constant factor of \ as sug- 
gested by Theorem 1. For any given graph and coupling type, we performed 
simulations for sample sizes n scaling as n = 10/3dlog(p) where the control 
parameter /3 ranged from 0.1 to upwards of 2, depending on the graph type. 

Figure 2 shows results for the 4-nearest-neighbor grid model, illustrated 
in Figure 1(a) for three different graph sizes p G {64, 100,225} with mixed 
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(a) (b) (c) 

Fig. 1. Illustrations of different graph classes used in simulations, (a) Four-nearest neigh- 
bor grid (d = 4). (b) Eight-nearest neighbor grid (d~8). (c) Star-shaped graph [d = Q(p), 

rd = e(iog(p))y. 

couplings [panel (a)] and attractive couplings [panel (b)]. Each curve corre- 
sponds to a given problem size, and corresponds to the success probability 
versus the control parameter /3. Each point corresponds to the average of 
N = 200 trials. Notice how, despite the very different regimes of (n,p) that 
underlie each curve, the different curves all line up with one another quite 
well. This fact shows that for a fixed degree graph (in this case deg = 4) , the 
ratio n/log(p) controls the success/failure of our model selection procedure 
which is consistent with the prediction of Theorem 1. Figure 3 shows anal- 
ogous results for the 8-nearest-neighbor lattice model (d = 8), for the same 



4-nearest neighbor grid (mixed) 4-nearest neighbor grid (attractive) 




Control parameter Control parameter 



(a) (b) 

Fig. 2. Plots of success probability P[A/±(r) = A/"(r),Vr] versus the control parameter 
j3(n,p,d) = n/[10dlog(p)] for Ismg models on 2-D grids with four nearest-neighbor inter- 
actions (d — 4). (a) Randomly chosen mixed sign couplings 6* at = ±0.50. (b) All positive 
couplings 9* t = 0.50. 



22 P. RAVIKUMAR, M. J. WAINWRIGHT AND J. D. LAFFERTY 



8-nearest neighbor grid (mixed) 8-nearest neighbor grid (attractive) 




Control parameter Control parameter 



(a) (b) 

Fig. 3. Plots of success probability P[A/±(r) =jV(r),Vr] versus the control parameter 
P(n,p,d) = n/[10dlog(p)] for lsmg models on 2-D grids with eight nearest-neighbor inter- 
actions (d — 8). (a) Randomly chosen mixed sign couplings 6* st = ±0.25. (b) All positive 
couplings Ott =0.25. 



range of problem size p £ {64, 100, 225} and for both mixed and attractive 
couplings. Notice how once again the curves for different problem sizes are 
all well aligned which is consistent with the prediction of Theorem 1. 

For our next set of experiments, we investigate the performance of our 
method for a class of graphs with unbounded maximum degree d. In par- 
ticular, we construct star-shaped graphs with p vertices by designating one 



Star graph; Logarithmic neighbors Star graph; Linear fraction neighbors 




Control parameter Control parameter 



(a) (b) 

Fig. 4. Plots of success probability ¥[Af±(r) = A/"(r),Vr] versus the control parameter 
P{n,p,d) = n/[10dlog(p)] for star-shaped graphs for attractive couplings with (a) logarith- 
mic growth in degrees, (b) linear growth in degrees. 
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node as the hub and connecting it to d < (p — 1) of its neighbors. For linear 
sparsity, we choose d= |~0.1p], whereas for logarithmic sparsity we choose 
d = [log(p)] . We again study a triple of graph sizes p 6 {64, 100, 225}, and 
Figure 4 shows the resulting curves of success probability versus control pa- 
rameter /3 = n/[Wdlog(p)]. Panels (a) and (b) correspond, respectively, to 
the cases of logarithmic and linear degrees. As with the bounded degree 
models in Figure 2 and 3, these curves align with one another showing a 
transition from failure to success with probability one. 

Although the purpose of our experiments is mainly to illustrate the con- 
sequences of Theorem 1, we also include a comparison of our nodewise l\- 
penalized logistic regression-based method to two other graph estimation 
procedures. For the comparison, we use a star-shaped graph as in the pre- 
vious plot, with one node designated as the hub connected to d = [O.lp] of 
its neighbors. It should be noted that among all graphs with a fixed total 
number of edges, this class of graphs is among the most difficult for our 
method to estimate. Indeed, the sufficient conditions of Theorem 1 scale 
logarithmically in the graph size p but polynomially in the maximum degree 
d; consequently, for a fixed total number of edges, our method requires the 
most samples when all the edges are connected to the same node, as in a 
star-shaped graph. 

For comparative purposes, we also illustrate the performance of the PC 
algorithm of Spirtes, Glymour and Schemes [30] as well as the maximum 
weight tree method of Chow and Liu [7]. Since the star graph is a tree (cycle- 
free), both of these methods are applicable in this case. The PC algorithm 
is targeted to learning (equivalence classes of) directed acyclic graphs, and 
consists of two stages. In the first stage it starts from a completely connected 
undirected graph, and iteratively removes edges based on conditional inde- 
pendence tests so that at the end of this stage it is left with an undirected 
graph which is called a skeleton. In the second stage, it partially directs some 
of the edges in the skeleton so as to obtain a completed partially directed 
acyclic graph which corresponds to an equivalence class of directed acyclic 
graphs. As pointed out by Kalisch and Buhlmann [18], for high-dimensional 
problems, the output of the first stage, which is the undirected skeleton 
graph, could provide a useful characterization of the dependencies in the 
data. Following this suggestion, we use the skeleton graph determined by 
the first stage of the PC algorithm as an estimate of the graph structure. 
We use the pcalg R-package [18] as an implementation of the PC algorithm 
which uses partial correlations to test conditional independencies. 

The Chow-Liu algorithm [7] is a method for exact maximum likelihood 
structure selection which is applicable to the case of trees. More specifically, 
it chooses, from among all trees with a specified number of edges, the tree 
that minimizes the Kullback-Leibler divergence to the empirical distribution 
defined by the samples. From an implementational point of view, it starts 
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Fig. 5. Plots of edge disagreement E[|{(s,t) | E st 7^ -Ej t }|] versus the control parameter 
P(n,p,d) = n/[10dlog(p)] for star-shaped graphs where the hub node has degree d = 0(p)- 
The results here are shown for attractive couplings with 8* t = 0.25 for all edges (s,i) 
belonging to the edge set. The i\ -penalized logistic regression method (LI), the PC method 
(PC) and the maximum weight forest method of Chow and Liu (CL) are compared for 
p = 64. 



with a completely connected weighted graph with edge weights equal to 
the empirical mutual information between the incident node variables of 
the edge and then computes its maximum weight spanning tree. Since our 
underlying model is a star-shaped graph with fewer than (p — 1) edges, a 
spanning tree would necessarily include false positives. We thus estimate the 
maximum weight forest with d edges instead where we supplied the number 
of edges d in the true graph to the algorithm. 

Figure 5 plots, for the three methods, the total number of edge dis- 
agreements between the estimated graphs and the true graph versus the 
control parameter (3 = n/[10d\og(p)]. Even though this class of graphs is 
especially challenging for a neighborhood-based method, the ^-penalized 
logistic-regression based method is competitive with the Chow-Liu algo- 
rithm, and except at very small sample sizes, it performs better than the 
PC algorithm for this problem. 

7. Extensions to general discrete Markov random fields. Our method 
and analysis thus far has been specialized to the case of the binary pairwise 
Markov random fields. In this section, we briefly outline the extension to 
the case of general discrete pairwise Markov random fields. (Recall that for 
discrete Markov random fields, there is no loss of generality in assuming 
only pairwise interactions since by introducing auxiliary variables, higher- 
order interactions can be reformulated in a pairwise manner [34].) Let X = 
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(Xi, . . . ,X p ) be a random vector, each variable Xi taking values in a set X 
of cardinality m, say X = {1,2, . . . , m}. Let G = (V, E) denote a graph with 
p nodes corresponding to the p variables {X\, . . . ,X p }, and let {(j) s :X— > 
R, s E V} and {4> s t : X x X — > R, (s, i) E E 1 }, respectively, denote collections 
of potential functions associated with the nodes and edges of the graph. 
These functions can be used to define a pairwise Markov random field over 
{X\ , . . . , X p ) , with density 

(46) F(x) (xex-pl^2<t) s (x s ) + ^ (j) st (x s ,x t )\. 

^sev (s,t)eE > 

Since X is discrete, each potential function <f> s t can be parameterized as 
linear combinations of {0, l}-valued indicator functions. In particular, for 
each s E V and j E {1, . . . , m — 1}, we define 



l[Xs=j] 



1, ifx s =j, 
0, otherwise. 



Note we omit an indicator for x s = m from the list, since it is redundant 
given the indicators for j = 1, . . . , m — 1. In a similar fashion, we define the 
pairwise indicator functions I[x s = j, Xt = k], for (j,k) € {1, 2, . . . , m — l} 2 . 
Any set of potential functions can then be written as 

Mx s )= Yl e *s-A*s=j] forsev, 

ie{l,...,m-l} 

and 

<t> st {x s ,x t ) = ^2 6*st;jk I [ x s=j,x t = k] for (s,i) eE. 
(i,fc)e{i,...,m-i} 2 

Overall, the Markov random field can be parameterized in terms of the 
vector 9* £ R m_1 for each s G V, and the vector 6* t E R( m_1 ^ associated 
with each edge. In discussing graphical model selection, it is convenient to 
associate a vector 6^ E R^" 1-1 ) to every pair of distinct vertices (u, v) with 
the understanding that 6* v = if (u, v) ^ E. 

With this set-up, we now describe a graph selection procedure that is 
the natural generalization of our procedure for the Ising model. As before 
we focus on recovering for each vertex r E V its neighborhood set and then 
combine the neighborhood sets across vertices to form the graph estimate. 

For a binary Markov random field (1), there is a unique parameter 8* t 
associated with each edge (r, t) E E. For m-ary models, in contrast, there 
is a vector 9* t E R( m_1 ) of parameters associated with any edge (r, t). In 
order to describe a recovery procedure for the edges, let us define a matrix 
0* r E R( m_1 ) x (p -1 ) where column u is given by the vector 6* u . Note that 
unless vertex r is connected to all of its neighbors, many of the matrix 
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columns are zero. In particular, the problem of neighborhood estimation 
for vertex r corresponds to estimating the column support of the matrix 
0* r — that is, 

Af(r) = {ueV\{r}\\\0* ru \\ 2 ^0}. 

In order to estimate this column support, we consider the conditional 
distribution of X r given the other variables ^\{r} = {Xt\t £ V \ {r}}. For 
a binary model, this distribution is of the logistic form while for a general 
pairwise MRF, it takes the form 

(47) r [Xr —j X\ r — x v \ - ^ eMe; ^ + v[xt = k]y 

Thus, X r can be viewed as the response variable in a multiclass logistic re- 
gression in which the indicator functions associated with the other variables, 

{i[x t = k],tev\ W, k e {i, 2, . . . , m - 1}}, 

play the role of the covariates. 

Accordingly, one method of recovering the row support of 0* r is by per- 
forming multiclass logistic regression of X r on the rest of the variables X\ r 
using a block £2/^1 penalty of the form 

|||€>\rll|2,l := ^2 \\dru\\2- 
ueV\{r} 

More specifically, let 36" = {x^\ . . . ,x*- n )} denote an i.i.d. set of n samples, 
drawn from the discrete MRF (46). In order to estimate the neighborhood 
of node r, we solve the following convex program: 

(48) min {£(e v ;X?) + A n |||8\ r ||| 2 ,i}, 

e\ r .g]R( m - 1 > 2 x(p- 1 ) 

where £(@\ r ; := ^ Y17=i l°g ^6 | £\ r ] is the rescaled multiclass logis- 
tic likelihood defined by the conditional distribution (47), and A n > is a 
regularization parameter. 

The convex program (48) is the multiclass logistic analog of the group 
Lasso, a type of relaxation that has been studied in previous and on-going 
work on linear and logistic regression (e.g., [19, 22, 25, 38]). It should be 
possible to extend our analysis from the preceding sections so as to ob- 
tain similar high-dimensional consistency rates for this multiclass setting; 
the main difference is the slightly different sub-differential associated with 
the block £2/^1 norm. See Obozinski, Wainwright and Jordan [25] for some 
related work on support recovery using £2/^1 block-regularization for multi- 
variate linear regression. 
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8. Conclusion. We have shown that a technique based on ^-regularized 
logistic regression can be used to perform consistent model selection in bi- 
nary Ising graphical models, with polynomial computational complexity and 
sample complexity logarithmic in the graph size. Our analysis applies to the 
high-dimensional setting, in which both the number of nodes p and maxi- 
mum neighborhood sizes d are allowed to grow as a function of the number 
of observations n. Simulation results show the accuracy of these theoretical 
predictions. For bounded degree graphs, our results show that the structure 
can be recovered with high probability once n/\og{p) is sufficiently large. Up 
to constant factors, this result matches known information-theoretic lower 
bounds [29]. Overall, our experimental results are consistent with the conjec- 
ture that logistic regression procedure fails with high probability for sample 
sizes n that are smaller than (D(dlogp). It would be interesting to prove 
such a converse result, to parallel the known upper and lower thresholds for 
success/failure of ^-regularized linear regression, or the Lasso (see [33]). 

As discussed in Section 7, although the current analysis is applied to 
binary Markov random fields, the methods of this paper can be extended 
to general discrete graphical models with a higher number of states using a 
multinomial likelihood and some form of block regularization. It should also 
be possible and would be interesting to obtain high-dimensional rates in this 
setting. A final interesting direction for future work is the case of samples 
drawn in a non-i.i.d. manner from some unknown Markov random field; 
we suspect that similar results would hold for weakly dependent sampling 
schemes. 

APPENDIX A: PROOF OF UNIQUENESS LEMMA 

In this appendix, we prove Lemma 1. By Lagrangian duality, the penalized 
problem (7) can be written as an equivalent constrained optimization prob- 
lem over the ball < C(A n ), for some constant C(X n ) < +oo. Since the 
Lagrange multiplier associated with this constraint — namely, A n — is strictly 
positive, the constraint is active at any optimal solution so that ||#||i is 
constant across all optimal solutions. 

By the definition of the £i-subdifferential, the subgradient vector z can 
be expressed as a convex combination of sign vectors of the form 

(49) z= Yl a v v > 

where the weights a v are nonnegative and sum to one. In fact, these weights 
correspond to an optimal vector of Lagrange multipliers for an alternative 
formulation of the problem in which a v is the Lagrange multiplier for the 
constraint {v,0) < C(A n ). From standard Lagrangian theory [3], it follows 
that any other optimal primal solution 8 must minimize the associated 
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Lagrangian — or equivalently, satisfy (21) — and moreover must satisfy the 
complementary slackness conditions a v {{v,8) — C} = for all sign vectors 
v. But these conditions imply that (z, 0) = C = \\9\\i which cannot occur if 
6j 7^ for some index j for which \zj \ < 1. We thus conclude that 9$ c = for 
all optimal primal solutions. 

Finally, given that all optimal solutions satisfy 9$ c = 0, we may consider 
the restricted optimization problem subject to this set of constraints. If the 
principal submatrix of the Hessian is positive definite, then this sub-problem 
is strictly convex so that the optimal solution must be unique. 

APPENDIX B: PROOFS FOR TECHNICAL LEMMAS 

In this section, we provide proofs of Lemmas 2, 3 and 4, previously stated 
in Section 4. 



B.l. Proof of Lemma 2. Note that any entry of W n has the form W™ 
n SILi where for i = 1, 2, . . . , n, the variables 



y{i) ._ T Wf r (i) 



.[x r = l\x®]+F e *[x r = -l\x®}} 



are zero-mean under Pg*, i.i.d. and bounded < 2). Therefore, by the 

Azuma-Hoeffding inequality [15], we have, for any 5 > 0, P[|W"| > S] < 
2exp(-^). Setting 5 = tttt^y, we obtam 



4(2-ce) ' 

^\w n \>- 



< 2exp 



2\2 



a 2 \ 



128(2 - a) 2 

Finally, applying a union bound over the indices u of W n yields 



a . 



> 



< 2exp 



2 \2 



128(2 - a) 



r n + log(p) 



as claimed. 



B.2. Proof of Lemma 3. Following a method used in a different context 
by Rothman et al. [28] , we define the function G : W 1 — > R by 

(50) G(u s ) := £(91 + u s ; X?) - £(9 S ; £?) + A n (||^ + u s \\- \\9* s \\). 

It can be seen from (24) that u = 9$ — 9* s minimizes G. Moreover, G(0) = 
by construction; therefore, we must have G(u) < 0. Note also that G is 
convex. Suppose that we show that for some radius B > 0, and for u £ P^ 
with 1 1 tz, 1 1 2 = B, we have G(u) > 0. We then claim that \\u\\2 < B. Indeed, if u 
lay outside the ball of radius B, then the convex combination tu+ (1 — 1)(0) 
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would lie on the boundary of the ball, for an appropriately chosen t S (0, 1). 
By convexity, 

G(tu + (1 - t)(0)) < tG{u) + (1 - t)G(0) < 0, 

contradicting the assumed strict positivity of G on the boundary. 

It thus suffices to establish strict positivity of G on the boundary of the 
ball with radius B = M\ n y/d where M > is a parameter to be chosen later 
in the proof. Let u £ M. d be an arbitrary vector with ||u||2 = B. Recalling the 
notation W = V£(#*; X"), by a Taylor series expansion of the log likelihood 
component of G, we have 

(51) G(u) = Wju + u T [V 2 £(6* s + au)]u + X n (\\9* s + u s \\- \\0* s \\) 
for some a € [0, 1] . For the first term, we have the bound 



(52) 



\Wju\ < HWsllooNU < ||Ws||ooVd||u|| 2 < (\ n Vd) 



2 M 



since ||Ws||oo < \- by assumption. 

Applying the triangle inequality to the last term in the expansion (51) 
yields 

An||#S + v>s\\x — \\0s\\l > — A n ||us||i. 
Since < V"||t*sr[|2j we have 



(53) 



A 



71 1| PS 



+ ush - \\e* s \\i > -XnVd\\u s \\ 2 = -M(Vd\ n ) 2 . 



Finally, turning to the middle Hessian term, we have 
q* :=A min (V 2 £(9* s + au;X?)) 
> min A min (V 2 £{9* s + 

a6[0,l] 



min A r 

aS[0,l] 



- V (x^ ; 9* s + au s )xf (xff 



i=l 



By a Taylor series expansion of rj(x^; •), we have 



7*>A„ 



i=l 
1 



max 

oe[o,i] 



n 

J2v'^;9* s + au s )(u T sX f)xf(xff 



i=l 



Kmn(Qss) ~ max 
q£[0,1] 



rf (x W ; 0* s + au s ) ( (u s , xf ))xf (xf f 



i=l 
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>c n 



max 
ae[o,i] 



i=l 



A(a) 



It remains to control the spectral norm of the matrices A(a), for a € [0, 1]. 
For any fixed a £ [0, 1] and y £~M, d with \\y\\2 = 1, we have 

(y,A(a)y) = -J2 v' ; 0*s + <™s) [(us, xf )} [(xf , y)} 2 



i=l 
n 



<-J2W^;9* s + au s )\\(u s ,xf)\[(xf,y)] 2 . 
i=i 



Now note that \r/(x^',6 s + aus)\ < 1, and 



\(u s ,xf)\ <Vd\\u s h = MX n d. 



Moreover, we have 



E«4 i} >y» 2 < 



r? 



i=l 



i=i 



by assumption. Combining these pieces, we obtain 

max P(q)||| 2 < D max MX n d < C min /2, 

ae[0,l] 

assuming that A n < 2 MD hl g - We ver ify this condition momentarily, after 
we have specified the constant M. 

Under this condition, we have shown that 



(54) 



q* := A min (V 2 £(8 s + cm; > C min /2. 



Finally, combining the bounds (52), (53), and (54) in the expression (51), 
we conclude that 

G(u s ) > (A n x/d) 2 {- \M + ^M 2 - m}. 

This expression is strictly positive for M = 5/C m i n . Consequently, as long 
as 



An< 



2MD max d WD max d 
as assumed in the statement of the lemma, we are guaranteed that 



PS 2 



< MX n Vd = — — X n Vd 



as claimed. 
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B.3. Proof of Lemma 4. We first show that the remainder term R n sat- 
isfies the bound ||-R n ||oo < D miix \\9s — ^5 Hi- Then the result of Lemma 3 — 
namely, that \\9s — 9*s\\2 < — can be used to conclude that 



'S 2 — 

lR n ll 9^n 

\ 1L 1 1 00 . il,i/ m£H 



as claimed in Lemma 4. 

Focusing on element R™ for some index j G {1, . . . ,p}, we have 

i?!? = [V 2 ^^; x) - V 2 £{9* ; - 9*} 

1 n 



i=l 



for some point 9^ = tjO + (1 — tj)9*. Setting g{t) = [!^exp(2f)] a ' no * e ^at 
rj(x;9) = g(x r ^2 t& v\r@rtXt)- By the chain rule and another application of 
the mean value theorem, we then have 

1 n 

r] = - Y^tf(8 uyr x®)(x i<) ) T [o (i) - e*]{xf(x^f[e- e*}} 

n 1=1 
1 n 

= L Y^{g'{^ T x^)xf}{[9^ - 9*] T x^(x^f[9- 9*}}, 



i=l 



where 9^ is another point on the line joining 9 and 9*. Setting cij 
{g'(9^ T x x®)xf} and h := {[0® - 9*] T x^ {x^) T [9 - 9*}}, we have 



n 



i=l 



< - \\a\ oo| 6| 1. 
n 



A calculation shows that Halloo < 1, and 



hb\\ 1 =t j [9-9r\l-t,^ ) ^ ) f)^ 

I i=l J 



i=l 



where the second line uses the fact that 9$? = 9$ c = 0. This concludes the 
proof. 
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APPENDIX C: PROOF OF LEMMA 7 

Recall from the discussion leading up to the bound (39) that element 
(J, k) of the matrix difference Q n — Q* , denoted by Zjk, satisfies a sharp tail 
bound. By definition of the ^-matrix norm, we have 



"[\\\Q n ^s-Qhs\\\oo>S]=¥ 



max 



^2\ z jk\ > $ 



fees 



< (p - d)I 



J2\ z jk\ > $ 



fees 



where the final inequality uses a union bound, and the fact that \S C \ <p — d. 
Via another union bound over the row elements 



fees 



<F[3k€ S\\Z jk \ > 5/d] 
<dP[\Z jk \>6/d}; 



we then obtain 



niQs°S - QhslWoo >5]<(p- d)d¥[\Z jk \ > 5/d] 

from which the claim (42a) follows by setting e = 5/d in the Hoeffding 
bound (39). The proof of bound (42b) is analogous with the pre-factor (p — d) 
replaced by d. 

To prove the last claim (42c), we write 

IIKQss) 1 ~ (Q*ss) 1 III oo = IIKQss) 1 [Q*ss — QssKQss) 1 llloo 

^VdUQlsr'iQh-QssKQhrX 
< VdUQ* ss )~%\\\Q* ss - Q n S sh\\\(Q n ss)-% 

<^\\\Qh-Qss\h\\\(Q n ssr%- 
From the proof of Lemma 5, in particular equation (40), we have 



n \ — 1 1 
SS) 



> 



5 2 n 

<2exp( -K-^- + B\og{d) 



for a constant B. Moreover, from (40), we have 

n\\Qss - Qssh > S/Vd\ < 2exp(-K^ + 21og(d)) 
so that the bound (42c) follows. 
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